Fast.ai Computational Linear Algebra Lecture 4
WNixalo - 2018/3/7
Using the real video 003 dataset from BMC 2012 Background Models Challenge Dataset
Other sources of datasets:
Imports :
In [4]:
import moviepy.editor as mpe
# from IPython.display import display
from glob import glob
import sys, os
import numpy as np
import scipy
import matplotlib.pyplot as plt
%matplotlib inline
In [5]:
# MAX_ITERS = 10
TOL = 1e-8
In [8]:
video = mpe.VideoFileClip("data/Video_003/Video_003.avi")
In [11]:
video.subclip(0,50).ipython_display(width=300)
Out[11]:
In [12]:
video.duration # seconds
Out[12]:
In [63]:
def create_data_matrix_from_video(clip, k=5, scale=50):
return np.vstack([scipy.misc.imresize(rgb2gray(clip.get_frame(i/float(k))).astype(int),
scale).flatten() for i in range(k * int(clip.duration))]).T
def rgb2gray(rgb):
return np.dot(rgb[...,:3], [0.299, 0.587, 0.114])
def plt_images(M, A, E, index_array, dims, filename=None):
f = plt.figure(figsize=(15, 10))
r = len(index_array)
pics = r*3
for k, i in enumerate(index_array):
for j, mat in enumerate([M, A, E]):
sp = f.add_subplot(r, 3, 3*k + j + 1)
sp.axis('Off')
pixels = mat[:,i]
if isinstance(pixels, scipy.sparse.csr_matrix):
pixels = pixels.todense()
plt.imshow(np.reshape(pixels, dims), cmap='gray')
return f
def plots(ims, dims, figsize=(15,20), rows=1, interp=False, titles=None):
if type(ims[0]) is np.ndarray:
ims = np.array(ims)
f = plt.figure(figsize=figsize)
for i in range(len(ims)):
sp = f.add_subplot(rows, len(ims)//rows, i+1)
sp.axis('Off')
plt.imshow(np.reshape(ims[i], dims), cmap="gray")
We're scaling an image from 1 moment in time to 60 x 80 pixels. We can unroll that picture into a single tall column. Instead of a 2D 60x80 picture, we have a 1 x 4800 column.
This isn't v.human-readable, but it's useful bc it lets us stack images from different times on top of one another, to put them all into 1 matrix. If we took a frame every 1/100th of a second for 113 s (11,300 images through time), we'd have a 11300 x 4800 matrix representing the video.
In [17]:
scale = 25 # Adjust scale to change resolution of image. eg 25% scale here.
dims = (int(240*(scale/100)), int(320*(scale/100))) # orig res: 320x240 (x,y)
In [18]:
M = create_data_matrix_from_video(video, 100, scale) # so k is ~fps
# M = np.load("lores_surveillance_matrix.npy")
In [19]:
print(dims, M.shape)
In [21]:
# display 141st scaled frame
plt.imshow(np.reshape(M[:,140], dims), cmap='gray');
Note: course notebook image is 'hi-res' bc R.Thomas did this first in full resolution (althought that'd be v.slow here).
Since create_data_from_matrix
is somewhat slow, we'll save our matrix. Ingenrl, whenever you have slow pre-processing steps, it's a good idea to save the results for future use.
In [24]:
np.save("lores_surveillance_matrix.npy", M)
NOTE: only run the below 2 lines on the lo-res (25%) version
In [25]:
plt.figure(figsize=(12,12))
plt.imshow(M, cmap='gray')
Out[25]:
That's what the whole video looks like. I'm gonna throw a wild guess that those lines (non-horizontals) are correlated with the movements in the video.
Oooo I was right.
Questions: What are those wavy black lines? What are the horizontal lines?
$\longrightarrow$ The wavy lines are the people moving in the video. The horizontal lines are the background.
Each column is a frame / moment in time. So smth like a stopsign will just stay static if the camera doesn't move.
The horizontal lines tell us that this matrix is low-rank. Rank is number of independant columns $\rightarrow$ and that turns out to be equal to the number of independant rows (?yeah?).
$\longrightarrow$ put another way: rank is the dimension of the column space or row space (which is eql to num of lin-indep cols or rows)
Review from Lesson 2: (Unit 2)
In [26]:
from sklearn import decomposition
In [27]:
u, s, v = decomposition.randomized_svd(M, 2)
# 2nd arg is num components, ie singular values, you want back. Trial&Error.
In [28]:
low_rank = u @ np.diag(s) @ v
USV is low rank because even though U has a ton of nonzero values, S is just 2 singular values (the components we told tSVD we wanted). Likewise V (or V$^T$) is 2 x ton-of-vals. So when you multiply USV, there'll only be 2 linearly-independant columns.
Linear Algebra is trippy and awesome. No wonder it makes people freak out about AIpocalpyses the moment you machine-power it.
In [33]:
u.shape, s.shape, v.shape
Out[33]:
NB: Σσ/Ss are getting mixed up. R.Thomas is using Ss as Σσ. The FB research post defines U=QS; Σ is the singular value matrix, while U.Lisboa calls Σ S. But R.Thomas also uses s
for Σ's values. Anyway, I got it now.
In [30]:
low_rank.shape
Out[30]:
In [31]:
plt.imshow(np.reshape(low_rank[:,140], dims), cmap='gray')
Out[31]:
We can take a look at our low-rank reconstructed matrix:
In [35]:
plt.figure(figsize=(12,12))
plt.imshow(low_rank, cmap='gray');
You can think of this as a data-compression problem. With just rank-2 (2 cols in U, 2 σs, 2 rows in V$^T$), we get this much of our original matrix back.
Now what if we just want to get the people back out of the video, instead of the background? That's as simple as subtracting:
In [39]:
plt.imshow(np.reshape(M[:,140] - low_rank[:,140], dims), cmap='gray');
Zooming in (this is better w/ hires)
In [42]:
plt.imshow(np.reshape(M[:,140] - low_rank[:,140], dims)[15:40,25:65], cmap='gray');
Okay. That's really cool that we can do so much with such a simple algorithm. Now how much better cold we do with a more sophisticated algo..
WHen dealing with high-dim datasets, we often leverage on the fact that the data has low intrinsic dimensionality iot alleviate the 'curse of dimensionality and scale' (maybe the data lies in a low-dim subspace or a low-dim manifold). Principal Component Analysis is useful for eliminating dimensions. Classical PCA seeks the best rank-k estimate L of M (minimizing $\lvert\lvert M - L\lvert\lvert$ where L has rank-k)
Traditional PCA can handle small noise, but is brittle wrt grossly corrupted observations -- even 1 g.c.o. can significantly mess up the answer.
Robust PCA factors a matrix into the sum of 2 matrices, M = L + S, where M is the orign. matx, L is low-rank, and S is sparse. This is what we'll be using for background removal.
Low-Rank means the matrix has lots of redundant (lindep) information -- in this case it's the background, which is identical in every frame.
Sparse means the matrix has mostly zero entries -- in this case see how the foreground picture (of people) is mostly empty). In the case of corrupted data, S captures corrupted entries.
When working w/ a hi-dim dataset, you want to use the fact that the data usually has a low intrinsic dimensionality -- ie: not all its dims have information.
Latent Semantic Indexing: $L$ captures common words used in all documents while $S$ captures the few key words that best disinguish each document from others
Ranking and Collorative Filtering: a small portion of the available rankings could be noisy and even tampered with (see Netflix RAD - Outlier Detection on Big Data)
The unit ball $\lvert\lvert x \lvert\lvert{}_{1}= 1$ is a diamond in the L1 Norm. It's extrema are the corners:
A similar perspective is to look at the contours of the loss function:
basically:
L1 norm wants to keep nonzero terms to a minimum, even if some weights get larger. L2 norm wants to minimize Sum-of-Squares, so it wants everything as small as possible $\Longrightarrow$ more nonzero terms.
L1 = (Σ|x$_i$|)$^1$ ; L2 = (Σx$_i$)$^{1/2}$
Robust PCA can be written:
$$minimize \ \lvert\lvert L \lvert\lvert{_*} + λ \lvert\lvert S \lvert\lvert {_1}$$$$subject \ to: \ L + S = M$$where:
English: the way we describe a sparse matrix is by minimizing the L1 norm; a low-rank matrix by minimizing the nuclear norm (which is the L1 norm of the singular values).
$\longrightarrow$ ie: if a lot of singular values are zero, you have a low-rank matrix. This leads into Optimization and courses taught by Stephen Boyd (linked below)
We'll use the general Primary Component Pursuit Algorithm from this Robust PCA paper, in the specific form of Matrix Completion via the Inexact ALM Method in §3.1 here -- Chen-Lin-Ma. Also referenced implementations here and here.
PCPA is one way of doing R-PCA, which is decomposing a matrix into a sum of a low-rank and a sparse matrix.
§1: Intro, and §5: Algorithms are our main interests.
You don't need to know the math or understand the proofs to implement an algorithm from a paper. There are (as of 2017-18) many theoretical researchers, and little pragmatic advice.
Algorithm on p29:
Definitions:
$\mathcal{S}$ : shrinkage operator
$\mathcal{D}$ : singular value thresholding operator
From the paper:
$\longrightarrow$ so the Shrinkage Operator is subtracting a value τ off of the singular values, floor-clamped to zero.
S
$_τ$[x] = sign(x)*max(|x| - τ, 0)
The Singular Value Thresholding Operator takes the SVD, makes the singular values smaller, then recomposes the matrix. $\longrightarrow$ decreases the magnitude a bit, but most importantly sets some σ's to zero $\rightarrow$ makes them more sparse.
$\longrightarrow$ similar to doing truncated SVD. Every σ < τ $\rightarrow$ 0.
which makes sense bc we just learned that that makes a low-rank matrix
Back to the Algorithm:
In 3.: the low-rank matrix is created from singular-value thresholding.
In 4.: the sparse matrix is created looking at the remaining error in the original matrix $-$ the low-rank matrix (bc you want your Low-Rank & Sparse Mats to sum to the original), and the $μ^{-1}Y_k$ is keeping track of what's still left to approximate.
THE BASIC IDEA of all this is to iteratively go back and forth between improving the Low-Rank matrix estimate and Sparse matrix estimate. And the Singular Value Threshold is v.similar to truncating a bunch of singular values.
§3.1 of this paper - Chen-Lin-Ma has a faster variation of this Algorithm:
ALM: Alternating Lagrance Multipliers
§4 has some helpful implementation details on how many singular values to calculate (& how to choose parameter vals):
If the singular value is small enough you just increment (huh?) otherwise you take ~20% of your dimensionality. ie: you don't calculate more than 20% of whatever the size of your matrix is.
We'll use Facebook's Fast Randomized PCA library.
In [44]:
from scipy import sparse
from sklearn.utils.extmath import randomized_svd
import fbpca
In [45]:
TOL = 1e-9
MAX_ITERS = 3
In [151]:
def converged(Z, d_norm):
err = np.linalg.norm(Z, 'fro') / d_norm # 'fro': Frobenius Norm
print('error: ', err)
return err < TOL
# the Shrinkage operator
def shrink(M, tau):
S = np.abs(M) - tau
return np.sign(M) * np.where(S > 0, S, 0)
# fast SVD implnt frm Facebook -- github.com/facebook/fbpca
def _svd(M, rank):
return fbpca.pca(M, k=min(rank, np.min(M.shape)), raw=True)
# NOTE: this is SVD, but Facebook calls it PCA..
def norm_op(M):
return _svd(M, 1)[1][0]
def svd_reconstruct(M, rank, min_sv):
u, s, v = _svd(M, rank) # take SVD
s -= min_sv # subtract off from singular values
nnz = (s > 0).sum()
return u[:,:nnz] @ np.diag(s[:nnz]) @ v[:nnz], nnz # reconstruct
def pcp(X, maxiter=10, k=10): # refactored
m, n = X.shape
trans = m < n # transpose the matrix:
if trans: X = X.T; m, n = X.shape # you want it to be tall & skinny
lamda = 1/np.sqrt(m)
op_norm = norm_op(X)
Y = np.copy(X) / max(op_norm, np.linalg.norm(X, np.inf) / lamda)
μ = k*1.25/op_norm; μ_bar = μ * 1e7; rho = k * 1.5
d_norm = np.linalg.norm(X, 'fro')
L = np.zeros_like(X); sv = 1
examples = []
for i in range(maxiter):
print("rank sv:", sv)
X2 = X + Y/μ
# update estimate of Sparse Matrix by "shrinking/truncating": original - low-rank
S = shrink(X2 - L, lamda/μ)
# update estimate of Low-Rank Matrix by doing truncated SVD of rank sv & reconstructing.
# count of singular values > 1/μ is returned as svp
L, svp = svd_reconstruct(X2 - S, sv, 1/μ)
# If svp < sv, youre already calculating enough singular values.
# If not, add 20% (in this case 240) to sv
sv = svp + (1 if svp < sv else round(0.05*n))
# residual
Z = X - L - S
Y += μ*Z; μ *= rho
examples.extend([S[140,:], L[140,:]])
if m > μ_bar: m = μ_bar
if converged(Z, d_norm): break
if trans: L=L.T; S=S.T
return L, S, examples
SVD & PCA are different but similar. PCA uses SVD. PCA typically involves multiplying a matrix by its transpose first, it also subtracts the mean.
Careful: subtracting the mean from a sparse matrix makes it no longer sparse.
In [147]:
L, S, examples = pcp(M, maxiter=5, k=10)
In [148]:
plots(examples, dims, rows=5)
In [149]:
f = plt_images(M, S, L, [0, 140, 1000], dims)
NOTE: I was having a huge bugging issue here with my code not reproducing the results from the course notebook. After a couple hours I found out I was setting μ += rho
instead of μ *= rho
....
In [150]:
# np.save("lores_L.npy", L)
# np.save("lores_S.npy", S)
Extracting a bit of the foreground is easier than identifying the background. TO accurately get the background, you need to remove the entire foreground, not just parts of it.
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]: